Goto

Collaborating Authors

 bessel function





Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment

arXiv.org Artificial Intelligence

Multimodal contrastive learning (MCL) aims to embed data from different modalities in a shared embedding space. However, empirical evidence shows that representations from different modalities occupy completely separate regions of embedding space, a phenomenon referred to as the modality gap. Moreover, experimental findings on how the size of the modality gap influences downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) How does it affect downstream tasks? To address these questions, this paper introduces the first theoretical framework for analyzing the convergent optimal representations of MCL and the modality alignment when training is optimized. Specifically, we prove that without any constraint or under the cone constraint, the modality gap converges to zero. Under the subspace constraint (i.e., representations of two modalities fall into two distinct hyperplanes due to dimension collapse), the modality gap converges to the smallest angle between the two hyperplanes. This result identifies \emph{dimension collapse} as the fundamental origin of the modality gap. Furthermore, our theorems demonstrate that paired samples cannot be perfectly aligned under the subspace constraint. The modality gap influences downstream performance by affecting the alignment between sample pairs. We prove that, in this case, perfect alignment between two modalities can still be achieved via two ways: hyperplane rotation and shared space projection.





Equivariance by Local Canonicalization: A Matter of Representation

arXiv.org Artificial Intelligence

Equivariant neural networks offer strong inductive biases for learning from molecular and geometric data but often rely on specialized, computationally expensive tensor operations. We present a framework to transfers existing tensor field networks into the more efficient local canonicalization paradigm, preserving equivariance while significantly improving the runtime. Within this framework, we systematically compare different equivariant representations in terms of theoretical complexity, empirical runtime, and predictive accuracy. We publish the tensor frames package, a PyTorchGeometric based implementation for local canonicalization, that enables straightforward integration of equivariance into any standard message passing neural network.


A Auxiliary Lemmas

Neural Information Processing Systems

We present some preliminary lemmas in this section. Most of them are basic inequalities in Information Theory, so they are only for auxiliary purposes in our proofs of the main theorems and the corollaries. We refer to Lemma 6.2 of the book Gray ( 2011). We refer to Theorem 14 of the article Liese and Vajda ( 2006). Assume that there are M disjoint / 2 -balls.


Achieving Rotational Invariance with Bessel-Convolutional Neural Networks

Neural Information Processing Systems

As of today, Convolutional Neural Networks (CNN) are one of the most powerful tools for image analysis. They achieve, thanks to convolutions, an invariance with respect to translations.